ATLA Dialogue Analysis Revisit

Given that I performed my last analysis as an excerise to literally teach myself how to use Python and Matplotlib, I wanted to clean it up, abstracting some of the distracting code into a helper file and using some new data visualization libraries I've come across. I also wanted to try my hand at transforming my web crawler from a synchronous program to an asynchronous program, which results in runtime improvements of 50% - check out the Performance Enhancement Section below)

Scrape Data

Find Appropriate Script Source

90% of the battle in data analysis/visualization is selecting the appropriate dataset. Unfortunately the website I used last time is now defunct (I sure hope my previous webscraper didn't kill them...). However, Wikipedia now has scripts hosted on their servers, and presumably these should be more robust.`

Dialogue

You may notice that the function below uses await - by modifying the html grabbing functions to be asynchronous instead of synchronous, we actual reduce the amount of time spent web scraping by 50%-70% depending on the internet connection at the time of calling the function.

See the Performance Enhancement Section at the end of this analysis

Visualize Data (New and Improved!)

I've learned about some new libraries since last time and think that the data trends I tried visualizing in matplotlib lineplots could be more itnerative and informative.

On the wordcloud front, I can use my space more efficiently if I organize the wordclouds in a grid instead of scrolling endlessly as they plot individual figures.

Charts

A few thoughts I had about how I could improve over the previous iteration of graphs;

  1. Use an interactive library like hvplot; the graphs have hover functionality and, in my opinion, look a little prettier. note that there seems to be an issue with hvplot's stacked bar library in modifying the hover labels for stacked bar graphs
  2. Become more intentional with the data I'm analyzing. Previously I analyzed lines of dialogue, but a line could be a single word or, in the case of ATLA's "The Tales of Ba Sing Se", an entire haiku poem. Counting actual words of dialogue gives a more accurate picture as to which characterse speak the most (and presumably get the most screentime)
  3. Standardize the data prior to analysis if I'm looking to make comparisons; I don't really care if one character speaks 100 words if characters 2, 3, and 4 are all speaking 300; I care that they're speaking 1/3 the value of the others, and that character 1 is only speaking 10% of the episode's total dialogue.
  4. Use a more appropriate graph type to represent the data (e.g. stepped area chart).

A few interesting things that this view shows us that was harder to see in the previous iterations of the graphs:

  1. Zuko actually gets two anti-hero centric episodes over the course of the season; in addition to the previously mentioned Zuko Alone, he also features heavily in Season 3's The Beach. Personally I feel like this strategy of giving Zuko such significant and humanizing episodes is a key reason why ATLA is remembered so fondly even now.
  2. Despite being one of the fandom's favorite characters, Toph averages just over half of the word per active episode as the rest of the primary Gaang; she speaks about 155 words per active episode, Sokka and Aang sit around 300 (Katara lags slightly at 242)
  3. Sokka actually speaks the most over the course of the series - not Aang. Our moonsword boy squeaks out a win!
  4. I personally remember Suki having a much more pronounced role in the plot than her appearances (and dialogue frequency within those appearances) would indicate. The numbers don't lie here, so shout-outs are in order for Suki's VA as well as the writing team and animators for making use of the her limited screen-time.

Wordclouds

Behind the Scenes: Performance Enhancements

While most of the code was overhauled between the previous analysis and this one, the bulk of the change was driven by necessity as the the previous script website no longer exists and scraping the Wikipedia dialogue tables required a different approach.

The core exception to this is in pulling the script data. Whereas the prior analysis utilized a synchronous process to pull each page's script, the code has been updated to utilize an asynchronous approach, improving the speed of execution. In other words, in the previous web crawler, I could not pull script 2 until I had finished pulling script 1. With this new approach, I can pull script 1 and 2 at the same time!

The performance increase results in the new code executing between 2-4x as fast as the previous iterations; the exact performancce increase fluctuates (presumably internet connection plays a factor, as there is web-scraping involved). Compare the runtime values in the execution profiles between the old method (usually 40-60s) and new method (~15s):

*Note: Performance times may be slightly inflated; pulled using a mobile hotspot. Proportons should still be comparable.